Bioinformatics for Computational Biology — Appunti TiTilda

Indice

Introduction

Hello reader this is a short summary of our notes about the course “BCB”. If you find an error or you think that one point isn’t clear please tell me and I fix it (sorry for my bad english). -NP

Chapter One: The cell

Cell: Unit of living being

Divided in:

Prokaryotes: Have a nucleus not clearly separated from the rest of cellular matter. Are unicellular organisms.

Structure:

Eukaryote: enclosed by a plasma membrane that contain:

1.1 The Big Four

All the cell are constituted by 4 main types of macromolecules:

Aminoacids’ structure:

1 Central atom of carbon (C) linked with:

Nucleic acids:

DNA= contains ALL the genetic information that is necessary for the life of the host organism. It’s organized in chromosomes.

Prokaryotes have only one chromosomes.

A cell can be haploid, diploid, triploid,…

This means that the n° of the chromosomes are n, 2n, 3n,…

1.2 Mitosis

Mitosis -> Asexual reproduction for a single cell.

“Mitosis”

1.3 Meiosis

Meiosis -> Sexual Reproduction between:

“Meiosis”

Chapter Two: Mendelian Genetics

In the 1865 Gregor Mendel create the inheritance transmission laws

2.1 First Law: Law of dominance

Mendel crossed plans that differ for only one alternative discrete character so pure dominant (RR) and pure recessive (rr)

The result that him obtain was all the four children displayed only dominant traits.

2.2 Second Law: law of segregation

“Second generation”

Ratio is 3:1

3 dominant;

1 recessive.

2.3 Third law: law of trait independent segregation

In F3 with 2 traits ratio is (3:1) * (3:1) = 9:3:3:1

9 dominant;

3 dominant trait and recessive

3 recessive and dominant

1 pure recessive

The second law evident that:

Linkcage: association between genes or groups of gene.

Thomas H. Morgan demonstrated the association among different genes of a chromosome exists, but is not total.

crossing - over: during meiosis, homolog chromosomes can exchange genetic material.

1 centimorgan \implies percentage of a recombined chromosomes in off springs, as a measure of the relative distance for a gene pair.

The percentage of recombination between two genes is proportional to their relative distance.

2.4 Sexual characters

Inheritance of genetic characters has particular importance for genes located on sexual chromosomes.

Sexual chromosomes: X and Y

In human, genes located in X and Y are associated with sex.

There are traits, influenced by sex and limited by sex.

Some pathologies linked to genetic alterations on X or Y:

For the genetic transmission the normal allele is dominant, the mutated one is recessive.

Gene can interact, generating unpredicted phenotypes and anomaly, like:

Phenotype: group of observable characteristics.

phenotype: genotype + environment

Same genotype can express different phenotype example: a female bee with the same chromosome complement of others, can became queen bee only if fed with royal jelly, otherwise it becomes workers bee.

1908, G.Hardy and W. Weimberg \implies in a balance population frequencies of genes and genotype tend to remain constant.

Given the distribution of paired A and a alleles we want to know the relative frequencies.

Pair: AA, Aa, aa -> the frequencies of A is unknown because it is present in AA and in Aa.

p: frequency of A.

q: frequency of a.

p+q = 100 \% \implies p = 1 - q \rightleftarrows q = 1 - p

AA: p^2

aa: q^2

aA: 2pq

So if AA + 2Aa + aa = (A+a)^2 = 1

p^2 + 2pq + q^2 = (p+q)^2 = 1.

Now we proved that the frequencies remain constant.

Given as known that AA is p^2, Aa is pq we can said that f_A: p^2 + pq, but we know that q = 1 - p so:

\begin{align*} f_a : p^2 + p(1-p) \implies p \end{align*}

we can do same reasoning fo f_a

When this happen we said that the population is balanced.

Fitness: The measure of reproductive ability of an individual.

Chapter Three: Molecular genetics

3.1 DNA

DNA (Deoxyribonucleic Acid)

DNA Chain: long sequence of nucleotides linked by the bond between the phosphoric acid of a nucleotide and the sugar of the subsequent nucleotide.

This bond is called 3' - 5'

In 1953, Watson and Crick defined the exact spatial structure of DNA considering two experimental result:

In 1945 Chargaff found that in the DNA of each organism.

In 1945 Rosalind Franklin and M. Wilkins obtained the first photographs of diffraction spectra from x rays of pure DNA fibers’ crystals, that showed:

From this information:

bp: base pair \implies n° of base pair before a regular behaviour

DNA’s structure types:

Each human cell \to 1m of DNA \implies packaging, technique that ultra compress the DNA.

3.2 RNA Structure

RNA is the protagonist in the synthesis of proteins.

Structural different point of view, RNA to DNA:

In some living being without DNA, RNA plays a leading role in reproduction process.

In cells, there are different types of RNA:

Of all these, mRNAs, rRNAs and tRNAs play important roles in proteins synthesis, others in regulation.

3.3 Virus

Genome: genetic material of an organism.

Generally indicates DNA, often also RNA and proteins.

Viruses: The most simple life forms -> cellular parasites.

They must use another cell for reproduce themself.

Virus genome: 1 molecule of nucleic acid (DNA or RNA) enclosed in a protein shell (capsid) with different shapes.

The viruses can be divided in 3 classes:

Bacteriophages: capsid with icosahedral head, containing genetic material, connected to a hollow cylinder (tail) to which filamentous structures (spikes) are linked, which allow the hanging of the virus on the bacterial cell’s wall.

When hanged, virus injects it genetic material inside the cell, where it reproduces itself.

3.3.1 Virus of eukaryote cells

Cellular transformation: phenomena coming from integration of virus in host cell’s DNA.

3.3.2 Retrovirus

Retrovirus: e.g Human Immunodeficiency Virus (HIV).

\begin{align*} \text{Viral RNA} \xrightarrow{\text{reverse transcriptase}} \text{Viral DNA} \end{align*}

Then transformed in double helix by another enzyme active in cell nucleus, DNA polymerase.

\implies genome virus can integrate in the host cell’s genome and reproduce itself.

3.4 Bacterial genome

Bacterial cells: (prokaryotes) don’t have a defined nucleus, but a compact structure (nucleoid):

In many bacteria, there are also small circular molecules of DNA \to plasmid

Exist variant plasmid:

3.5 Genome of Eukaryote

Chromosomes \xrightarrow{constituted} chromatin \implies

If a protein is strictly linked to DNA is called histones

Chromatin has structured like a necklace:

Context: cellular division

Metaphase: Chromosomes assume the X shape:

The position of centromere, length of chromatids and dimension of chromosomes identify different chromosomes -> karyotype of the organism.

I can highlight areas with rich in A and T bases

the areas with rich in C and G remains pale.

I generated a striped arrangement -> bands

Nomenclature: Each band has a specific nomenclature (e.g. 6p21.3)

Not all genetic material of eurokaryotes is in the nucleus.

Small fraction of circular DNA is in cellular organelles:

Extra-nuclear genes in cytoplasm are transmitted to the off spring just by the ovum -> maternal inheritance

To guarantee the transmission, the DNA is copied -> duplication

Same process in eukaryotes and prokaryotes, semi-conservative -> each daughter has 1 strand of DNA of the mother and 1 of new synthesis.

Semi-conservative duplication:

The 2 double helixes consisting in:

Replication fork: Area where double helix opens and synthesis starts.

The duplication happens in specific position at a time (not concurrent):

  1. Use a specific enzyme for opening localization.

  2. Copying pairing of bases and polymerization of nucleotides.

    \xrightarrow[\text{enzymes}]{} **DNA polymerase and Ⅰ \implies both direction

    On 5' to 3' copying.

    On 3' to 5' copying but we have okasaki fragments, segment of DNA that are synthesized and linked on DNA discontinuously by DNA ligase enzyme.

  3. Re- closing of double helix by specific enzymes.

3.6 Gene structure

Codon: triplet of bases.

But how the gene is structured ?

start codon: unique triplet which declares the beginning of the gene.

stop codon: unique triplet which declares the end of the gene.

Human genome -> 3000 Mb (Mega bases) \implies 22k - 25k genes.

Central Dogma of Molecular Biology (Crick 1958)

Transcription DNA to RNA from 5' to 3' \implies only one helix of DNA is used -> mold helix

Synthesis is catalyzed by enzyme RNA polymerase -> different in prokaryotes and eukaryotes.

3 phases:

  1. Starting transcription

    • RNA polymerase bonds (with its \sigma) with gene’s promoter.
    • \sigma detaches and transcription starts
  2. POlymerization of polynucleotide RNA -> elongation

  3. Detaching of synthesized RNA and end of transcription.

Transcription factors: proteins;

The transcription factors necessary for each polymerase.

RNA polymerase synthesizes RNA in continuous way.

splicing: In eukaryotes, remove intron from pre-RNA -> mature RNA

There are different types of splicing:

In a gene with alternative splicing, the majority of exons is always included in final mRNA.

Exist 4 types:

Splicing is not well-know process.

In some cases, small Nuclear RiboNucleoProtein (SNRNP) cut:

  1. 5' end of intron by dinucleotide GU
  2. In 3' end by dinucleotide AG
  3. exons link together.

Exist some intron that follow GU-AG rules without SNRNP -> auto-splicing -> these RNA are called ribozyme

After splicing -> stabilizing mRNA by adding:

3.7 Different types of RNA

Structural and functional components of ribosomes, where the synthesis of proteins occurs.

rRNA, tRNA and mRNA participate in translation.

3.8 Genetic Code

Genetic code: group of rules defining how the information of nucleotides’ sequence in mRNA (4 bases A,G,C,U) is translated in aminoacids’ sequence of the codified protein (20 aminoacids).

Each codon:

Feature of genetic code

This characteristic about triplets may occur concurrent encoding cause by splicing and mostly alternative splicing.

Some alternative transcripts are tissue-specific \implies expressed only in one specific type of cell.

Mechanisms of genetic code and alternative splicing allow encoding and production of many proteins with different functions from the same DNA.

3.9 Translation

Translation: complex process involving many cellular components: rRNA, mRNA and tRNA.

tRNAs are junctions between nucleotides of mRNA and aminoacids of protein:

The translation is same for prokaryotes and eukaryotes, has 3 phases:

  1. Start
    • Ribosome bonds to mRNA by starting triplet (AUG);
    • Identification of mRNA’s AUG triplet by complementary specific tRNA triplet (anticodon)
    • Bond of tRNA that brings aminoacid corresponding to AUG triplet (Met)
  2. Synthesis:
    • Process goes on;
    • Ribosome moves along mRNA;
    • Only 1 triplet available at a time for bonding to specific tRNA;
    • Aminoacids brought by tRNA are near;
    • When ribosome moves, a peptide bond is created between last aminoacid transported and the previously one.
    • Protein chain extends due to ribosome moving.
  3. End:
    • When ribosome reaches a stop triplet (UAA, UAG, UGA)
      • Detaches from mRNA.
      • Sets protein chain free.

Each ribosome builds only 1 protein at a time.

In bacteria (prokaryotes), requiring synthesis of many copies of the same protein in short time (minutes):

In bacteria transcription and translation are paired.

Central Dogma

Not all genes are always necessary for the life of a cell -> only constituent one (necessary for the life of the cell) are always expressed, other expressed when necessary.

3.10 Genetic Expression

Expression of genes is controlled by cellular needs: environment conditions and functions to execute.

Multi-cellular organisms:

Bacteria

Francois Jacob and Jaques Monod (1960 - 64) use lactose in E.coli.

Lactose is a disaccharide (sugar of 2 monomers, glucose and galactose) that can be utilized when divided into the 2 components inside the cell.

Splitting of lactose is realized by enzymes codified by 3 genes:

In default of lactose, in the cell \cong 5 molecules of each enzyme.

As for the sugar, if lactose is the only source of energy, synthesis of enzymes is rapidly stimulated \implies inducible enzymes

Genes IacZ, IacY and IacA -> structural genes, are consecutive on bacterial chromosomes and transcribed in the same mRNA.

Before the big three, there is IacI that regulates them ; its elimination brings continuous synthesis of 3 the enzymes.

Mechanism of regulation

Repressor bonded to operator prevents RNA polymerase transcription of 3 structural genes.

If lactose is present, it bonds to repressor, changes its 3D conformation preventing its bond to operator.

When lactose is totally consumed, repressor bonds again to operator and synthesis stops.

Superior Organisms

Main mechanisms are similar but regulation is more complex.

Genetic expression regulated by proteins, transcription factors, bond DNA sites before gene, Transcription Factor Binding Sites, and can allow or stop bond of RNA polymerase to promoter of gene.

Example

Protein metallothionein that protects cells from toxic effect of metals free in the environment:

Gene of metallothionein is transcribed by RNA polymerase II

Many traits of DNA before gene are involved in its expression:

Such zones, elements of response to metals, modulate transcription based on metals’ concentration.

Transcription factors have leading role in regulation:

More common structure \implies helix-turn-helix and zinc-finger

3.11 Proteins

Proteins: Macro-polymers constituted by linking of aminoacids (minimum 3); there are 20 aminoacids:

Peptide: short polymer constituted by the linkage of aminoacids bonded with peptide bonds.

Peptide bond: bond between N-terminus of an aminoacid and C-terminus of another one -> planar and rigid \implies NO rotable-bond

Polypeptides: have 1 free N-terminus (beginning) and 1 free C-terminus (end) -> contains from 3 to various hundreds of aminoacids.

Proteins have different functions, ultimate for all organisms:

Function executed by protein depends on properties of protein, determined by:

3.12 Proteins’ structure

It’s structured in 4 related levels:

Isoform: two protein which are different for little details, due to alternative splicing or to polymorphisms.

3.13 Genetic mutations

During duplication of DNA it is possible to have variation in the sequence of nucleotide bases (mutations) that are transmitted to offspring (mutants):

Single Nucleotide Polymorphism, or SNP: is the variation of 1 single nucleotide in an individual’s DNA sequence.

\begin{align*} CCU \to \text{Pro} \xrightarrow{SNP} CCC \to \text{Pro} \text{in this case nothing change} \\ AAG \to \text{Lys} \xrightarrow{SNP} GAG \to \text{Glu} \text{in this case SNP change the protein synthesized} \end{align*}

SNPs are likely to be good biological markers.

3.14 Types of genetic mutations

3 classes of mutations:

3.14.1 Genetic Mutations

Genetic mutations can derive from different alterations:

In the substitution scenario is possible have a situation like: A-T \xrightarrow{substitution} G-T, in this case, G-T is an unstable bond and in the next replication we can have G-C or A-T

3.14.2 Chromosomal mutations

Chromosomal mutations: changes in chromosomal structure compared to normal karyotype

Main types of chromosomal anomalies:

Less harmful than deletion

!!! In human, involved in tumor on set !!!

3.14.3 Genomic Mutations

Genomic mutations concern total number of chromosomes in each cell of an individual

Example:

From errors in meiotic process, like failed disjunction in pair of homolog chromosomes:

3.15 Mutagens

Frequency of mutations can increase if organism is exposed to substances and radiations (mutagens) that interact with DNA and can induce changes in nucleotide sequence.

3.16 Fixing DNA and Genome

All living being have various cellular mechanisms for fixing DNA damages:

In mankind lack or reduction of one or more involved enzymes is associated with inherited pathology that brings formation of skin tumors due to ultraviolet radiations present in solar rays.

Genome: Entire genetic material of an organism

In bioinformatics, genomic data/information: whole of available data and information, related to genetic material of an organism.

Transcriptome: whole of all possible transcripts of an organism.

Proteome: whole of all possible proteins of an organism, deriving from different transcripts.

3.17 Evolutionary biology

Evolutionary biology is a sub-field of biology regarding the origin of species from a common ancestor, as well as their changes, multiplications and diversifications over time.

Then:

Today:

Sequence regions that are homologous are also called conserved

Sequence homology may indicate common function

Homologous sequences are said orthologous if they were separated by a speciation event.

Homologous sequences are said paralogous if they were separated by a gene duplication event.

So, homologous sequences can be divided into two groups:

Phylogenesisor phylogenetic: study of life’s evolution

Taxonomy: classification of organisms depending on similarities.

Phylogenetic trees: diagram that shows relation of common descent of taxonomic groups of organisms.

Computational phylogenetic: concerns the compilation of phylogenetic tree and the study of anatomic, biochemical, genetic and paleontological data used for their construction.

Phylogenetic trees are built on the base of a high number of genetic sequences.

Many techniques used to identify the best tree -> complexity NP (Nondet.Polinomial- time)

Phylogenetic trees are important but have some limits:

Chapter Four: Biomolecular Sequence Analysis

Why do we do sequence comparison ?

Two types of alignment:

Different technique:

4.1 Alignment 2 sequences

4.1.1 Dot matrix

Simplest one, we build a matrix with sequence 1 as column and sequence 2 as row and we put an “x” when we have a match.

Filtering of background noise

Pros:

Cons:

Practice

4.1.2 Pairwise alignment

Also simple, it’s based in 3 type of action:

We write the two sequences and compare.

-: gap

C \\ | \\ C: match

G \\ | \\ C: mismatch

We can assign a score, use:

gap = - 2

mismatch = -1

match = +2

highest is the score better is the alignment.

Distance between two strings:

Now we will talk about the score assign to gap, mismatch and match. Why ?

Because biologically the substitution cannot be consider equal each other.

We have to consider:

4.1.3 Substitution matrix

Substitution matrix: Assign value to each possible pair of characters

There are two main types of matrices:

PAM

PAM matrices: developed in the late 70s looking for mutations in closely correlated superfamilies of amino acid sequences.

Accepted Mutation: accepted by evolution.

For the construction of PAM matrices homogeneous blocks of aligned sequences are considered.

To avoid the problem of multiple substitutions, very similar sequences are chosen to determine PAM matrix:

For each amino acid (j), count all N_{jk} changes (quantity of changes) in another amino acid (k)

Normalize by dividing by the total number of changes (\sum_{m} A_{jm}, 1 \leq m \leq n)

n = number of amino acids = 20

A_{jk} =\frac{N_{jk}}{\sum_m A_{jm}}

PAM contains the log odd probability (p) of transition of each amino acid into another amino acid

p = log (odd(P))

odd(P) = \frac{P}{1-P} \implies p = log (\frac{P}{1-P})

If PAM_{i,j} > 0, likely transition of i in j

If PAM_{i,j} = 0, random transition of i in j

If PAM_{i,j} < 0, unlikely transition of i in j

The classical PAM expresses the probability of change in one step \implies PAM1

If we want in two step: PAM1 * PAM1 \implies (PAM1)^2 \implies PAM2

In ten: (PAM1)^10=PAM10

This is the percentages of change, PAM2 \implies 2\%

The number identify the evolutionary step, the change of an aminoacid out of 100 ones.

The PAM250 is the most used, the amino acid sequences maintain at this level 20\% of similarity.

Example

Take the PAM250(F \to Y): 0.15

Divide by frequency of changes into F(0.04) = log_{10}(\frac{0.15}{0.04}) = 0.57

like wise for Y \to F : log_{10} (\frac{0.2}{0.03}) = 0.83

Calculate the score for a change F,Y as 10 * \frac{(0.83 + 0.57)}{2} = 7

We will obtain something like this:

“PAM250 log odds”
BLOSUM

BLOSUM of substitution of amino acids

A block is a highly conserved region without gaps

How calculate the matrix ?

For each pair of amino acids x and y, calculate the ratio of the like hood (e_{xy}) that x and y are aligned by chance

Example

Values calculated based on the substitutions in a set of 2000 conserved patterns

To avoid that very similar sequences in a block polarize the estimation, clusters are created in the block.

To find relationships between sequences close in time by evolutionary point of view, a large n is used.

BLOSUM62 is the standard.

BLOSUM vs PAM

So the gap penalty is g = g_o + g_e * (l-1)

where g_o is the first gap penalty, high

g_e is the gap extension, lower penalty

l is the length of the gap block

In the scholar exercise we use a linear gap penalty for simplify.

But in the real software we use the real gap penalty.

This because in biology is likely that a mutation event makes a long gap than a lot of scattered gaps.

4.1.4 Needleman - Wunsh

This method is an algorithm, the optimal one.

This algorithm is also complete.

Optimal algorithm: find the best solution, if it find one.

Complete algorithm: If exist, the algorithm find always a solution.

How works ?

First penalty and rewards:

BUild the matrix like the dot matrix but an additional row and column.

Now starting the matrix with a 0 in the (0,0) cell, now we write the score, there are 3 movement:

We have to complete the matrix continue the score with the movement that maximize this one.

We start filled the row 0 and the column 0 with “gap movement”

After fulfill the matrix we have the best score in the last cell (4,4), in this case, now we backtrack the movement and obtain the solution(s).

Solution:

ATGC \\ ATCC

Score: 5

In general we use PAM or BLOSUM matrix for the score.

We can obtain more of one solution, in this case we write all of them.

This algorithm is used for global alignment, so for find the best alignment in the whole sequences.

4.1.5 Smith - Waterman algorithm

This algorithm is for local alignment so for find the best sub-sequences.

It’s completely equal to the Needleman-Wunsh algorithm with one crucial difference there aren’t negative score. So the same matrix above become:

So in this case the best sub-sequence is

ATGC \\ ATCC

Score: 5

It’s possible there are multiple sub-sequence with the highest score, we have to write all of them, REMEMBER if a score going to zero this is a reset point, so next score is for another sub-sequence, either if a score decrease but don’t go to zero is the same sub-sequence, as in the example.

For ANY pairwise alignment, the used measure are:

For the score z a score \geq 5 suggests significance of the alignment found between the two sequences.

Probability p is obtained by: p = 1 - e^{-kmne^{-\lambda S}}

Guide for the E score:

4.2 Database Research

The classic programs that search for sequences in databases are FASTA and BLAST

The heuristic principle that these programs use is the search for “words” in databases.

word: short series of characters in the sequences of amino acids or nucleic acids.

These words are indicated with the term k-tuple -> k = n° of characters.

4.2.1 FASTA

FASTA (FAST - All) it’s an heuristic program that can search for global homology of sequences.

Exist two variants that can search for local homology:

FASTA is specific but not quite sensitive.

4 phases:

Initially, create a positional table containing all the positions for each amino acid (or nucleotide) in the query sequence and in each sequence in the database

I can built considering the position individually 1-tuple or in pair 2-tuple. For the nucleotide 4-tuple or 6-tuple.

Calculate the difference of positional values of each amino acid between the query and the database.

The best 10 regions (best 10 subsequence) selected are evaluated through the score matrices, the sub regions that contain the bases that maximize the regions score are identified.

The aim is finding the initial region with the best score, to be used to create a rank of the sequences in the database, in order to define which of them are the most similar to the query sequence.

FASTA evaluates if it’s possible to join together different regions of similarity.

Constraints to create the join:

Sequences with higher similarity are aligned to the query sequence using the procedure based on a modified Smith-Waterman algorithm \implies optimized score (OPT)

4.2.2 BLAST

BLAST (Basic Local Alignment Search Tool)

It searches for best local alignment between a query sequence and the sequences in a database

Features:

While FASTA searches all possible words of the same length, BLAST limits the search to the most significant words using a preventive filter.

For score, in case of protein it uses BLOSUM62

BLAST fixes the length of the word to:

3 Phases:

It generated a list of words of length W from the query sequence.

For each words, we assigned to each 20^3 words found in the database.

Use a threshold T to limit the number of analogous words.

The search (exact) of the best analogous words in the sequences of the sequences of the database is performed.

When searched analogous words are found in database’s sequences, they identify regions of possible local alignment (without gap) between the query sequence and the sequences found in the database

The algorithm tries to extend aligned regions, without allowing gaps, and until extended alignment score does not decrease \implies High - Scoring Segmented Pairs (HPS)

HPS is considered relevant if exceeds a threshold value S.

Important: At the end, it generated the best alignment according to Smith-Waterman algorithm.

Variation of E and p

p-value

Filters

BLAST vs FASTA

Both heuristic

They don’t grant to find the best alignment

Variant of BLAST

4.3 Information

Motifs are regular combinations of protein secondary structures associated with particular functions.

So same motif \implies similar function

Search for protein motifs can identify new genes and study the diffusion of specific motifs in different genomes.

Uses for search protein motifs

4.4 Multiple Alignment

Why ?

The alignment in pair allows:

The multiple alignment allows:

Formal definition:

A multiple alignment associates with S_1, ..., S_k the sequences S_1', ..., S_k' : S_i' \in (\Sigma \cup \{-\}) for 1 \leq i \leq k so that:

Profiles

Profiles are useful structures for summarizing the common proprieties of groups of sequences and they are the basis of many methods of multiple sequence alignment

Example:

Shannon Entropy

GIven a probability space (s,p), the entropy H is a measure of dispersion of the probability function of the objects in the space S

\begin{align*} H = - \sum_{i=1}^m p_i log_2 p_i \end{align*}

Information content

Given a matrix of weights that models sequence alignment, you can determinate the information content I(k) for each alignment position k:

\begin{align*} I(k) = log_2(m) - (H(k) + e(n)) \end{align*}

The alignment logo shows the information content of each position of the multiple alignment.

Usefully of extraction profile

Databases of profiles/patters:

To align a sequence to a profile, we use Needleman - Wunsh but with a different scoring function.

\begin{align*} \sigma_{sp} (b,i) = \sum_{a \in \Sigma} P_{i,a} \sigma (a,b) \end{align*}

To align two profiles -> \sigma_{pp} (i,j) = \sum_{k=1}^{|\Sigma| + 1} f(P_{i,k}', P_{j,k}'')

So different multiple alignment \implies we need a score standard to be able to compare them:

The most used function is the Sum - of - Pairs score, sum of the scores, sum of the scores of the pairwise alignments induced by the multiple alignment:

\begin{align*} \sigma (s) = \sum_{i=1}^{i<z} \sum_{j=i+1}^z S(s_i, s_j) \end{align*}

S(s_i,s_j) is the score of alignment of pairs of sequences s_i and s_j induced by multiple alignment M.

z is the number of sequences in the multiple alignment.

Example

S1: ACTCT \\ S2: A-TTT \\ S3: A-TTT \\ \sigma(s) = S(s_1,s_2) + S(s_1,s_3) + S(s_2, s_3) = 3 + 3 + 6 = 12

Other function of scoring:

Entropy

H(A) = \sum_{c \in A} H(c)

H(c) = - (\sum_{x \in \Sigma} p_x log_2 p_x)

c: column of the alignment A

p_x: frequency of the symbol x in column c.

Example

ACT \\ ACA \\ A-T \\ H(1) = - (\frac{3}{3} log_2 \frac{3}{3} + 0 + 0 + 0 + 0) = 0 \\ H(2) = - (0 + \frac{2}{3} log_2 \frac{2}{3} +0 + 0 + \frac{1}{3} log_2 \frac{1}{3}) = 0.92 \\ H(3) = - (\frac{1}{3} log_2 \frac{1}{3} + 0 + 0 + \frac{2}{3} log_2 \frac{2}{3} + 0) = 0.92 \\ H(A) = 0 + 0.92 + 0.92 = 1.84

Circular Sum

CS(A) = \frac{1}{2} \sum_{i=1}^z MPA(a_i, a_{i+1})

We do the pairwise score sum immediately, so:

match: +1

mismatch/gap: -1

ACA \\ ACC \\ AT- \\ MPA(a_1,a_2) = 1 + 1 - 1 = 1 \\ MPA(a_2,a_3) = 1 - 1 - 1 = -1 \\ MPA(a_3,a_1) = 1 - 1 - 1 = -1 \\ CS(A) = \frac{1}{2}(1 - 1 - 1) = -1

Sum-of-Pair vs Circular Sum

Sum-of-pair is clearly inefficient from an evolutionary point of view

4.4.1 Dynamic programming

Now if we have 2 sequences we need a 2d-matrix, 3 sequences 3d-matrix, n sequences nd-matrix. This approach is very complex, it’s called NP-complete(Non Polynomial) \implies very difficult and a lot of time.

Example 10 sequences each length = 100 we have 100^{10} = 10^{20} elements \implies 100 mil. terabytes

Solution \implies Heuristic and approximations.

Heuristic methods:

4.4.2 Progressive alignment

Simple and the most common

Idea: we align 2 sequence, then we align other 2 and we continue then we align the 2 aligned sequences with other 2 or with an unaligned one.

Like MERGE-SORT

Heuristic: similarity degree

So we aligned the similar ones, until we remain without sequence unaligned.

Feng - Doolittle

Algorithm that implements progressive alignment heuristics:

4.4.3 Star-Center

Given a set S of z sequences, we define central sequence S_C \in S the sequence that minimizes the function:

\sum_{S_j \in S} D(S_C, S_j)

or, the sum of the distances of all the sequences from S_C will be the minimum possible.

Then we use Sum-of-Pairs

4.4.4 Iterative Alignment

It starts aligning the newest couple of sequences according to a certain definition of distance (not the same pf progressive).

Then, at each step it takes the sequence with the minimum distance from all sequences already aligned and it aligns it to the alignment profile already created.

In case, create new space "-"

4.5 Multiple alignment tool

ClustalW is the most popular tool for the multiple alignment.

Then, it builds a phylogenetic tree, it consider the first couple how a singular sequence and builds another similarity matrix and go until finish the tree.

It is obtained a tree with branches of length proportional to the distance between the sequences \implies dendrogram

Details of ClustalW’s output

At the bottom of each column:

Chapter Five: Measurement of Genetic Expression

Genetic Expression: Conversion of coded information in a gene, for coding genes, first in messenger RNA and the in protein.

Not every gene is always necessary for the cell life

Gene expression is regulated by the cell necessity: environment conditions and functions necessary to be performed.

The genetic expression, is different depending on the cell type and the answer from the environment.

The transcriptome is the complete set of gene transcripts and of their levels of expression, in a particular type of cells or tissue, in well defined conditions.

To understand biological organisms it is necessary to study:

System biology: study of interactions between components of a biological system and how such interactions induce functions and behaviour of the system.

For functional analysis of genomes:

High - throughput procedures.

These approaches of genotyping must be correlated with phenotypic analysis of model organisms and cells in vitro.

5.1 Gene expression analysis techniques

How to measure the gene expression ?

Methods to measure the expression level \frac{gene(s)}{time}: RT-PCR (Reverse Transcriptase Polymerase Chain Reaction)

Main analysis techniques of the whole transcriptome:

1980: RNA analysis of one or few genes at a time

1995: RNA analysis whole genome

Two main technologies of DNA microarrays:

5.1.1 Northern Blot

Laboratory technique to study genetic expression, by finding the RNA (or isolated mRNA) in a sample

In 4 step:

  1. RNA Extraction: we extract the RNA.
  2. Preparation of the probe: fragment of the gene that we have to analyze.
  3. Hybridization: We wait until the probe and the gene create a bond, if the gene is expressed.
  4. The probe is a marker (radioactive or fluorescent) and its insensitive is proportional to the quantity of expressed gene.

5.1.2 RT-procedure

The polymerase chain reaction (PCR) is a laboratory technique exploiting DNA replication to amplify a single or few couple of specific sequence of DNA, up to \cong 10kb long, also 40kb.

PCR is based on thermal cycles of heating and cooling of a solution where the replication reaction of DNA occurs, we use high temperature for divide the helix of DNA, and low temperature for replication of DNA.

The reverse transcription polymerase chain reaction (RT-PCR) is a variation of the PCR, in which a RNA helix, firstly is reverse-transcribed in its complementary DNA (cDNA), by using the enzyme reverse transcriptase, and the resulting cDNA is amplified by using traditional PCR, or real-time PCR, made in a thermal cycler for automatic time and temperature control.

Another way to replicate pieces of DNA uses plasmids if bacterial as vectors to clone DNA sequences, the DNA fragment are inserted in the DNA sequence of the plasmid and the DNA ligase enzyme to bind to the plasmid DNA fragment to be cloned -> recombinant plasmid.

5.2 DNA microarrays

Microarrays: orderly and miniaturized arrangements of fragments of DNA with know sequences on solid support.

The microarrays is the evolution of Northern blot, microarrays can analyze the entire genome while the Northern blot only one or few genes.

Application:

Since they allow to determine the profile of expression of the expression of the cell in a given state, it’s also said that microarrays allow expression profiling

5.2.1 cDNA microarrays

4 steps:

  1. BUilding of the cDNA microarrays: full section of ESTs (Expressed Sequence Tags, short sub-sequences of a transcribed cDNA sequence).
  2. Sample preparation: two mRNA samples are prepared, retro-transcribed into cDNA and made fluorescent with different colors (Cys3, green, uses for the control; Cys5, red, uses for the test).
  3. Hybridization: gene transcripts expressed in sample, prepared and marked are hybridized with their complementary sequence on the microarray
  4. Measure of the gene expression: the fluorescent measure in every spot gives a measure of which genes are expressed in each of the two samples.

Images of cDNA microarray:

From images to data: A laser take the insensitive of each spot and transform it in a data.

Pros and Cons of “spotted” technology:

5.2.2 Oligonucleotide microarrays

In place of the ESTs, there are oligonucleotides long 20-80 bases, designed to represent ORFs.

Composition of each set of sequences of oligonucleotides:

Example

PM: ATC

MM: AAC

For each probe with sequence of PM there is on chip another probe with sequence of MM.

Each gene is represented as a set of 10-20 oligonucleotides, corresponding to some positions of the represented gene, each with PM and MM.

So the probes are synthesized directly on the chip and not put mechanically as in the cDNA microarray.

Oligonucleotides are synthesized in situ on the silicon chip by lithography:

Advantages:

How I can see the intensity’s expression ?

I deposit the target (mRNA og the cell, marked) on the chip and it will bond with the PM or MM sequence, now with a software I can track the insensitive of each bond.

Expression = avg[I(PM) - I(MM)]

For the score we use: R_i \frac{[I(PM)_i - I(MM)_i]}{[I(PM)_i + I(MM)_i]}

Detection p-value: performed hypothesis test that the score differs significantly from a close to zero threshold:

This type of test is used when the data doesn’t follow a normal distribution we have to do a difference between I(PM) and I(MM) and we order it after this we assign a rank 1 to the first one and 2 to the second and etc…

At the finish we will sum all the ranks with the positive score and the ranks with a negative one.

From this we find the min(W^+, W^ -) = 1

All possible combination: N = 3 \implies 2^3 = 8

So

With the p-value we can estimate if the hybridization of a probe is presence (P), absence (A) or marginally (M)

5.2.3 cDNAs vs Oligonucleotides

Search of regulated transcripts

Comparative analysis allows to compare, for each represented transcript, the expression level of one condition to another, directly with the probe set expression level.

In this way, it is possible to identify and to qualify accurately alterations at transcriptional level between two samples.

5.2.4 Microarray and Gene Expression Data (MGED)

MGED team

Team’s goal is to simplify:

Glossary

MIAME

Principles:

ArrayExpress

5.3 Data analysis of expression data

Issues:

5.4 DNA Microarray

Microarray: microscope slides or chips that contain ordered series of probes.

Exists a lot of type of microarray based on:

We focus on DNA microarrays -> expression profiling.

Goal: study the effect of treatments, diseases, etc. on gene expression.

Used also to analyze gene sequence in a sample.

The cDNA microarrays is based in two channel

Hybridized with cDNA from two samples to be compared (cancer cells vs healthy cells) and laded with two different dyes:

The oligonucleotide microarrays are based on single channel

5.4.1 Normalization

Normalization: process of removing systematic variations that affect measured gene expression levels in microarray experiments.

Sources of systematic variation:

The expression ration of the i-th gene on all arrays used is defined as:

\begin{align*} T_i = \frac{R_i}{G_i} ,\ i = 1,...,N_{\text{genes}} \end{align*}

Issue with the regulation the expression ratios treat genes differently:

Solution: log_2(ratio) = log_2(T_i) = log_2 (\frac{R_i}{G_i}) \implies symmetric distribution

Assumptions:

Normalization

\begin{align*} R_i' = R_i \\ G_i' = K_{global}*G_i \\ K_{global} = \frac{\sum_{i=1}^{N_{array}}R_i}{\sum_{i=1}^{N_{array}}G_i} \\ log_2(T_i') = log_2(T_i) - log_2(K_{global}) \end{align*}

M_i = log_2(R_i) - log_2(G_i) \\ A_i = \frac{log_2(R_i) + log_2(G_i)}{2}

5.4.1.1 LOWESS Normalization

LOWESS(LOcally WEighted Scatterplot Smoothing): Locally Weighted Linear Regression

For each A value, we calculate the regression line on the basis of a subset of points (M,A) around such A value (M(A)).

LOWESS correction: M_i' = log_2(T_i') = log_2(T_i) - M(A_i) = M_i - M(A_i)

Can be made equivalent to a transformation on the intensities.

R_i' = R_i \\ G_i' = G_i * 2^{M(A_i)}

5.4.1.2 Local Normalization

Most normalization algorithms can be applied either globally or locally

For spotted arrays, local normalization is often applied to each group of array elements deposited by a single spotting pen.

Local normalization can help correcting for systematic spatial variation

Let M_j(A) the LOWESS fit for the sub-array j :

\begin{align*} M_i' = log_2(T_i') = log_2(T_i) - M_j(A_i) = M_i - M_j(A_i) \end{align*}

5.4.1.3 Variance regulation

This normalization influence the log_2 (ratio) measurements, but the variance of the measured log_2(ratio) values might differ from an array region to another.

Let \sigma_j^2 denote the variance of the normalized log_2(ratio) values in the j-th sub-array:

\begin{align*} \sigma_j^2 = \frac{1}{N_j} \sum_{i=1}^{N_j} (M_i')^2 \\ M_i' = M_i - M_j(A_i) \end{align*}

The scaling factor is: a_j = \frac{\sigma_j^2}{(\prod_{k=1}^{N_{sub-array}}\sigma_k^2)^{\frac{1}{N_{sub-array}}}}

Then we obtain: M_i'' = \frac{M_i'}{a_j} = \frac{log_2(T_i')}{a_j} \implies T_i' = \frac{R_i'}{G_i'} \implies M_i'' = \frac{log_2(\frac{R_i'}{G_i'})}{a_j}

To measure the sub-array dispersion, it’s possible with MAD(Mean Absolute Dispersion): MAD_j = median_j(|M_i - median_j(M_i)|)

5.4.1.4 Between array normalization

The particular approach used for between array normalization depends on the chosen experiment design.

We consider two experiment design choices:

Dye - reversal analysis

We assume to have two samples, A and B

  1. A \to R, B \to G
  2. A \to G, B \to R

T_{1,i}' = \frac{R_{1,i}'}{G_{1,i}'} = \frac{A_i}{B_i}

T_{2,i}' = \frac{B_i}{A_i}

We are making comparison between identical samples

log_2(T_{1,i}' * T_{2,i}') = log_2(\frac{A_i}{B_i} * \frac{B_i}{A_i}) = 0

Replicate averaging

We assume to have more than one replicate for the same experiment:

\begin{align*} M_{k,i}' = log_2(T_{k,i}') = log_2 (\frac{R_{k,i}'}{G_{k,i}'}), k = 1,...,N_{replicates} \\ A_{k,i}' = \frac{1}{2} log_2(R_{k,i}' * G_{k,i}') \end{align*}

The simplest strategy is to average the M and A values:

\overline{M_i} = \frac{1}{N_{replicates}} \sum_{k=1}^{N_{replicates}}M_{k,i}' \\ \overline{A_i} = \frac{1}{N_{replicates}} \sum_{k=1}^{N_{replicates}}A_{k,i}'

This is the geometric average of the raw measurements R and G

5.4.2 Detection of differential expression

Goal: identification of genes that are significantly differentially expressed between one or more pairs of samples in the data set, after data normalization.

Strategies:

Goals:

Sample mean: \overline{M_i} = \frac{1}{N_{replicates}} \sum_{k=1}^{N_{replicates}}M_{k,i}'

Sample standard variance: s_i^2 = \frac{1}{N_{replicates} - 1} \sum_{k=1}^{N_{replicates}}(M_{k,i}' - \overline{M_i})^2

Two hypothesis:

Distribution under the null hypothesis

We assume that the normalized log_2(ratio) measurements are zero-mean Gaussian distributed with unknown variance \sigma_i^2:

M_{k,i}' \in N(0,\sigma_i^2)

t-statistic: t_i = \frac{\overline{M_i}}{\frac{s_i}{\sqrt{N_{replicates}}}}, s_i: standard deviation

The t-statistic is used to compare a sample mean to a specific value \mu_0 (independent one-sample t-statistic)

t = \frac{\overline{x} - \mu_0}{\frac{s}{\sqrt{N}}}, N: sample number

If the population is normally distributed, under the null hypothesis the t-statistic is distributed as a t-student distribution with N-1 degrees of freedom (dof).

If the true deviation \sigma of the population was known in advance, we would have used the z-statistic:

\begin{align*} z = \frac{\overline{x} - \mu_0}{\frac{\sigma}{\sqrt{N}}} \ \text{normally distributed} N(0,1) \end{align*}

The p-value represents, the probability of observing under the null hypothesis a value less likely than that of the t-statistic.

Let F(t) = \int_{- \infty}^t f(t)dt denote the cumulative density function of the probability density function under the null hypothesis the p-value of the i-th gene is given by: p_i = 2 * (1-F(|t_i|)) \to this because we don’t distinguish between up and down regulated.

The p-value is also defined as the significance level:

\alpha defines the actual false positive rate only when testing the differential expression of one gene at a time. When testing multiple genes simultaneously, as it’s usually the case, multiple test correction is needed.

There are a lot of methods to adjust the p-value \implies p^{adj} - value, these correct the value indifferent ways:

The t-statistic is not ideal because the small deviation drives the t-statistic value.

Alternative statistics:

Experimental design of genetic expression studies:

5.5 Machine Learning

Input Data: a set of microarray experiments with many variables (genes).

For each gene, a feature vector is formed by combining its normalized expression values in the available samples.

2 types of learning:

If the class variable has continue instead of discrete values we use regression analysis instead of classification.

5.5.1 Definitions

Distances

All machine learning (ML) approaches rely on some measure of distance between samples.

We must be aware of the distance function being used.

The choice of distance is IMPORTANT \implies it influences the outcome.

A distance measure d_{ij} between two vectors, i and j, must obey several rules:

Distances represented as matrix, where the (i,j) is the distance between sample i and sample j. (row,column)

They are symmetric.

d_{ij}^E = \sqrt{\sum_{k=1}^N (x_i^k - x_j^k)^2}

d_{ij}^c = \sum_{k=1}^N \frac{(x_i^k - \mu_i)(x_j^k - \mu_j)}{\sqrt{\sum_{k=1}^N(x_i^k - \mu_i)^2} \sqrt{\sum_{k=1}^N (x_j^k - \mu_j)^2}}

Correlation measure linear association and is not resistant (one outlier can ruin it)

d_{ij}^M = \sum_{k=1}^N |x_i^k - x_j^k|

5.5.2 Unsupervised learning (UL)

Also know as clustering or class discovery, the idea is to determine how many groups are in the data and which variables seem to define the grouping.

Clustering algorithms are methods to divide a set of n observation into g group so that within group similarities are larger than between group similarities.

The number of groups, g, is generally unknown

No training sample

Cross-validation difficult.

5.5.2.1 UL Models - Hierarchial clustering

Two type:

It must be defined:

Agglomerative hierarchical clustering

Input: one feature vector for each gene

  1. Initialization: each cluster consists of a gene.
  2. Compute the distance between each pair of clusters.
  3. Merge the two clusters with the smallest inter-cluster distance.
  4. Go to step 2, until all genes are contained within one big cluster.

Output: dendrogram

5.5.2.2 UL - Partitioning methods

Agglomerative cluster partitioning methods:

K-mean

  1. Initialization:

    • Define the number of clusters k
    • Designate a cluster centre for each cluster
  2. Assign each data point to the closest cluster centre -> datapoint is now a member of the cluster.

  3. Calculate the new cluster centre (geometric average of each member of the cluster)

  4. Calculate the sum of within-cluster sum-of-squares of distances of cluster elements from cluster centroid.

“k=2”

A common problem: if the initial partitions are not chosen carefully enough the computation has the chance of converting to a local minimum, rather that to the global minimum solution.

A solution is the funny logic.

Fuzzy logic allows algorithm to accept the possibility that a single data point can belong to more than one cluster.

The SVD, sometimes called Principal Component Analysis (PCA), can be used to:

How works ?

Consider the matrix A \in \R^{m \times n} where:

The SVD of A is defined as A = U \Sigma V^T where:

Linear algebra point of view:

“SVD”

Clustering two options:

5.5.3 Supervised Learning

There are several techniques:

The dimensionality is often huge, but small sample size.

5.5.3.1 SL-Techniques

Classification technique:

  1. Initialization
  1. Given a test sample:

    • Compute the distance between the test sample and all training samples
    • Retain the top k training samples sorted based on the distance from the test sample
    • Each neighbor votes for its label (given a tag): assign to the test sample the label that receives more votes.

Support Vector Machines

There are two types:

5.5.3.2 Key aspects

Chapter Six: Bio-terminologies and bio-ontologies

Bio-terminologies and bio-ontologies have an important role in e-science.

Collection of terms, precise and universally comprehensible, that univocally define and identify different concepts.

Semantic structured used to:

Example

Bio-ontology issue

Ontology development is fragmented

In general there are too much annotation and confusion without a real standard.

The US National Institute of Health (NIH) has founded the National Center of Biomedical Ontology (NCBO)

Resources:

6.1 OBO

The OBO foundry is an open, inclusive and collaborative experiment involving developers of science-based ontologies aiming at:

In order ro be part of OBO, an ontology must be:

There are several ontologies:

Structure:

Exist numerous different controlled vocabularies, a lot of vocabularies implies a lot of chaos so the Unified Medical Language Systems (UMLS) was created and maintained (by the US National Library of Medicine) as a support for integration of biomedical textual annotations scattered in distinct databases.

6.2 Enrichment Analysis

Given a list of genes found relevant in studied condition, we like to understand why such genes are relevant in that condition.

We want to:

We can:

Goal: detect significant enrichments and/or depletions of annotation terms within a target set of genes if interest, with respect to a master set.

Problem statement

NO new annotation is generated with the enrichment analysis.

For each annotation term t_i:

Null hypothesis:

Under the null hypothesis, belonging to the target set B is independent from being annotated with term t

The probability of observing k genes in the target set annotated to the term t is given by the hypergeometric distribution:

Fisher Exact test is a test of significance used in place of Chi-square test in 2x2 tables, especially with small samples.

Given the probability P of a contingency table with proportion of cases on the diagonal with most cases due chance of sampling:

P = \frac{n_T! * (n_A - n_T)! * n_B! * (n_A - n_B)!}{k! * (n_T - k)! * (n_B - k)! * (n_A - n_B - n_T + k)! * n_A!}

6.2.1 Biological interpretation and multiple testing correction

The methods assume that the test are independent and assume right, but if ontological annotations are used parent-child dependencies between annotation terms exist.

Exist methods to exploit the ontology structure to de-correlate ontology terms:

Alexa et al

Analyze the ontology terms of the annotations bottom-up; two methods:

  1. Elim method:

    • For each level of the ontological hierarchy:
      • If a term is found to be significantly enriched, remove the annotations to its ancestor terms of the genes annotated to it from the target and master set.
  2. Weight method:

    • Improve the first method, NO elimination but assign soft weight

Grossman et al

Goal: avoid inheritance problem -> if the father is enriched, the children tend to be also enriched

Limitations

6.3 Gene Set Enrichment Analysis (GSEA)

Computational method comparing a ranked list of gene L to gene sets/signatures defined on prior biological knowledge, to determine whether the gene set shows statistically significant, concordant differences between two biological states.

All this annotated gene sets are hosted iin the Molecular Signature Database (MSigDB)

Genes from an expression dataset are ordered in a ranked list L.

Given an apriori defined gene set S, GSEA determines whether members of S are randomly distributed throughout list L or occur primarily toward the top or bottom.

An enrichment score indicates if the genes in set S are clustered toward the beginning or end of the ranked list L

6.3.1 How works ?

3 steps:

  1. Calculation of the Enrichment Score ES(S)
    • ES(S) reflects the degree a set S is overrepresented at the extremes of the entire ranked list L
    • The enrichment score is the maximum deviation from zero in the random walk.
      • Rank genes in the dataset based on the correlation of their expression profiles with class C as to form L=\{g_1,...,g_N\} consider an independent gene set S and its N_H genes.
      • Evaluate the fraction of genes g_1,..., g_N also in S(hit) weighted by their correlation r_j and the fraction of genes not in S(misses) up to a given position i in L
      • P_{hit} (S,i) = \sum_{g_j \in S} \frac{|r_j|^P}{N_R}, N_R = \sum_{g_j \in S} |r_j|^P, j \leq i
      • P_{miss}(S,i) = \sum_{g_j \notin S}\frac{1}{N - N_H}, j \leq i
    • The ES(S) is the maximum deviation from zero to P_{hit} - P_{miss}
  2. Estimation of significance level of ES
    • Statistical significance (p-value) of observed ES is estimated by using a permutation test:
      • Phenotype/class label permutations: randomly assign phenotype labels to samples, reorder genes, and re-compute ES(S)
      • Repeat previous step to perform 1000 permutations to generate a null distribution and create a histogram of the corresponding enrichment score ES \ NULL
      • Estimate empirical significance of observed ES(S) relative to null distribution of ES \ NULL scores, using the positive or negative portion of the distribution based on sign of observed ES(S)
  3. Adjustment for multiple hypothesis testing
    • Any significance level must be adjusted, so also the same thing:
      • For each gene set S', compute ES(S')
      • We adjust S' with permutation \pi \to ES(S',\pi)
      • Normalize with Normalized Enrichment Score NES(S') and NES(S',\pi)
      • Compute the FDR of each NES

6.4 Functional similarity analysis

Computing functional similarity between genes

Goal: the title, based on annotations describing their functions

Traditional strategies are based on:

Issue: the majority of co-functioning genes:

Hypothesis: if two genes have similar functional annotation profiles, they should be functionally related -> Measure of functional similarity based on gene annotation profiles:

Functional similarity based on controlled vocabularies

Controlled vocabulary schemas mandate the uses of predefined, authorized terms that have been preselected.

In their simplest version, there are no semantic links between the terms in the controlled vocabulary.

Annotation of genes:

Example tool: DAVID (the Database for Annotation, Visualization and Integrated Discovery)

Use Kappa (k) statistical index.

Computing similarity based on annotation profile

Typically, with ontological annotations, it is a two step procedure:

  1. Compute ontological term-to-term similarity.
  2. Compute gene-to-gene similarity based on annotation profile

Optional step:

Step 1: term-to-term similarity method

6.4.1 Term-to-term similarity

5 steps:

  1. Compute the frequency (probability) of occurrence of a term in a corpus
    • Freq(c) = \sum \{occur(c_i) | c \in Ancestors(c_i)\}
    • Prob(c) = \frac{Freq(c)}{max(Freq)}
  2. Compute the information content Ic(c) of a term:
    • Ic(c) = - log(Prob(c))
    • More rare is the term: \downarrow p(c) and \uparrow Ic(c)
    • More common is the term: \uparrow p(c) and \downarrow Ic(c)
  3. Find common ancestors of two terms
    • CommonAnc(c_1,c_2) = Ancestors(c_1) \cap Ancestors(c_2)
  4. Compute shared information between two terms:
    • Share(c_1,c_2) = max\{Ic(a) | a \in CommonAnc(c_1, c_2)\}

i.e. the IC of the Lowest Common Ancestor (LCA)

  1. Compute similarity metrics between two terms:
    • Resnik
      • Sim_{Resnik}(c_1,c_2) = Share (c_1, c_2)
    • Jiang
      • dist_{JC}(c_1,c_2) = Ic(c_1) + Ic(c_2) - 2*Share(c_1,c_2)
    • Lin
      • Sim_{Lin}(c_1,c_2) = \frac{2*Share(c_1,c_2)}{Ic(c_1) + Ic(c_2)}

6.4.2 gene-to-gene similarity

Step 2: gene-to-gene similarity:

How validate ?

Optional step

Chapter Seven: Biomolecular Databank

There are a lot of biomolecular data:

So \underline{\text{DataBank}}

2 types:

Example of primary databank

There three majority databanks joined the International Nucleotide Sequence Database Collaboration and promotes these projects:

Specialized databanks collect sets of homogeneous data from the taxonomic and/or functional point of view data with annotation and additional values information.

Example:

Can be classified as:

Databank access types:

Tools for use gene and protein annotations:

Chapter Eight: Biological Networks

Why ?

  1. Model complex interactions among distinct entities
  2. Integrate and visualize distinct data and information
  3. Investigate properties to better understand the underlying system.

Types of biological networks:

For build a gene co-expressed networks we start from the gene expression data by similarity measure.

Similarity metrics:

A network is a structures formed by N nodes and E edges

Can be:

Topological measures of networks:

\frac{\text{\# triangle of node}\ i}{\text{\# connection of}\ i * \frac{\text{\# connection of }i - 1}{2}}

Example

C_a = \frac{3}{5 * \frac{5-1}{2}} = \frac{3}{10}

Degree centrality -> of link \implies hubs are the most central nodes

Closeness centrality

Average distance from the node i to all the other n-1 network nodes

l_i = \frac{1}{n-1}\sum_j d_{ij}

closeness centrality: c_i = \frac{1}{l_i} = \frac{n-1}{\sum_j d_{ij}}

betweenness centrality b_i = \sum \frac{\text{\# of shortest paths connecting}\ i,j\text{via}\ i }{\text{\# of shortest paths connecting}\ i,j} = \sum_{j,k} \frac{n_{j,k}(i)}{n_{j,k}}

Weighted Gene Co-expression Network Analysis (WGCNA)

The WGCNA is popular systems biology strategy to explore yhe system-level functionality of a transcriptome:

  1. Construct a gene co-expression networks represented mathematically by matrix, the element of which indicates co-expression similarity between a pair of genes.
  2. Identify modules using hierarchial clustering: WGCNA uses a topological overlap matrix and dissimilarity measure to obtain modules, that can be biologically meaningful in real data analysis.
  3. Relate modules to phenotypic and/or clinically relevant traits:
    • One can test the module-trait association between the module eigengene and the trait.
    • One can also use the module significance (MS), which is defined as the average gene significance (GS) to a trait of all genes in the module. The GS of a node is the correlation between the node and the trait.
    • Lastly, the module membership of a gene i (MM(i) = cor(x_i, ME)) measures the importance of the gene within the module.
  4. Study inter-module relationships and module preservation.
  5. Find key drivers in interesting modules.
Ultima modifica:
Scritto da: Niccolò Papini